RAG

RAG / Retrieval-Augmented Generation

Generally, RAG is a LLM(Large Language Model) which can fetch data from external source(eg: vector database, SQL Db, Graph DB, Web Search engines) & feed to AI generation process.
Purpose of RAG? To give more Context to LLM models to predict better
What is vector?
What is Embedding Model?

RAG Pipeline


1. [Retrieval Phase] Chunks are fed into vector DB
             |-------------------- A. Retrieval Phase (Offline) --------------|
             |                                                                |
Raw          |  |--- Chunker ---|                                             |
documents->  |- | break docs in |--chunks->[Embedding]-vectors--> [vectorDB]  |
logs         |  |smaller pieces |          [  Model  ]                 |      |
             |  |---------------|                                      \/     |
             |                                                       index    |
             |----------------------------------------------------------------|

             [Node1 (score 0.92), Node2 (score 0.87), Node3 (score 0.76)]

2. [Augmentation Phase] User asks a query & information retrieved from Vector DB
User's Query: Show firewall policies blocking outbound traffic?

        index from vectorDB(Step1)
          |
          \/
        |------B. Augmentation Phase --------|
User    | index + User's Query = Prompt      | 
query-> |                                    |
        | Combine index vector DB            |
        | into well crafted prompt           |-> augmented_prompt
        |------------------------------------|

augmented_prompt=
"Context: 
[Node1][Node2][Node3] 
Question: Show firewall policies blocking outbound traffic
Answer:"

3. [Generation Phase] Feed augmented_prompt into LLM.
With (user_query + vector), LLM hallucinations reduces drastically

                     |-- LLM --|
augmented_prompt --> | GPT5.0  | --> Reponse (less hallucinations)
                     |---------|

1. The Retrieval Phase:
Chunking: Raw documents are broken down into smaller, readable pieces.
Embedding: These text chunks are converted into mathematical
representations (vectors) using an embedding model.
Vector Search: User asks a question, system searches vector

2. The Augmentation Phase:
Once the relevant information is retrieved, it isn't just displayed. It is packaged.
The system takes the user’s original query and the retrieved
text chunks, combining them into a specifically crafted db

3. The Generation Phase
This combined prompt (the user's query + the retrieved
context) is fed into the LLM.
This forces the model to synthesize an answer based only on
the provided external data, which drastically reduces hallucinations

RAG Flow

@startuml

actor admin as admin 
box Retriever Phase #LightCyan
participant em as "from llama_index.embeddings.openai \nimport OpenAIEmbedding\n\nEmbedding Model"
participant vdb as "from llama_index.core \nimport VectorStoreIndex\n\nVector DB"
end box

box Augumentation Phase #LightYellow
participant rs as "class ResponseSynthesizer\n inside query_engine"
end box

box Generation Phase #Pink
participant llm as "llama_index.llms.openai \nimport OpenAI\n\nLLM"
end box

actor User as u

admin -> em: Feed Raw documents(data)
note over em
Create Tensors/Vectors
end note
em -> vdb: vectors


u -> rs: user_query
vdb -> rs: vectors/nodes
activate rs
rs --> rs: Create augumented_prompt \n augumented_prompt=\n  (Context+user_query+vectors)
deactivate rs

rs -> llm: augumented_query
llm -> u: Response of query

@enduml

RAG Pipeline Code

User queries from security logs

We have log files(eg: VPN, firewall).
RAG pipeline will read log files and provide answers to Administrator questions.


./logs/vpn.log
2025-05-01 VPN_LOGIN_FAILED user=john.doe ip=185.22.11.4
2025-05-01 VPN_LOGIN_FAILED user=john.doe ip=185.22.11.4

./logs/firewall.log
2025-05-01 FIREWALL_DENY src=10.1.1.5 dst=8.8.8.8 policy=OUTBOUND_BLOCK
2025-05-01 FIREWALL_DENY src=10.1.1.6 dst=1.1.1.1 policy=OUTBOUND_BLOCK

$ cat rag_pipeline.py
import os
import dotenv
from llama_index.embeddings.openai import OpenAIEmbedding
from llama_index.core import VectorStoreIndex, SimpleDirectoryReader
from llama_index.llms.openai import OpenAI		#Import LLM
from llama_index.core import Settings

# Load GitHub Token and set env
dotenv.load_dotenv()
if not os.getenv("GITHUB_TOKEN"):
    raise ValueError("GITHUB_TOKEN is not set")
os.environ["OPENAI_API_KEY"] = os.getenv("GITHUB_TOKEN")
os.environ["OPENAI_BASE_URL"] = "https://models.inference.ai.azure.com/"

############## 1. Retrieval Phase Start #################
## A. Setup Embedding Model. This is Neural Network	
embed_model = OpenAIEmbedding(
    model="text-embedding-3-small",
    api_key=os.getenv("OPENAI_API_KEY"),
    api_base=os.getenv("OPENAI_BASE_URL"),
)
Settings.embed_model = embed_model

## B. Break documents into Chunks
documents = SimpleDirectoryReader("./logs").load_data()

## C. Pass Chunked documents to Embedding model
# And store Chunks into local vectorDB
# def from_documents(documents, insert_batch_size=150): 
#   embed_model = Settings.embed_model #embed_model from Global
#   nodes = self._chunk_documents(documents) #chunks the documents into Nodes
#   for batch in batches(nodes, batch_size=insert_batch_size):
#       texts = [node.text for node in batch]
#       embeddings = embed_model.get_text_embedding_batch(texts)          
#       self._vector_store.add(embeddings, metadata=batch.metadata) #Store Tensors into the vector DB
index = VectorStoreIndex.from_documents(documents, insert_batch_size=150)
############## Retrieval Phase End #####################

# Create LLM
llm = OpenAI(
    model="gpt-4o-mini",
    api_key=os.getenv("OPENAI_API_KEY"),
    api_base=os.getenv("OPENAI_BASE_URL"),
)

############## 2,3. Augumentation & Generation Phase Start #################
# def query_engine(query_string: str):
////// Augumentation Phase. Create augmented_prompt //////
#    query_tensor = Settings.embed_model.get_text_embedding(user_query_string)
#	 top_k_nodes = self._vector_store.similarity_search(
#       query_tensor, 
#       similarity_top_k=3
#    )
# top_k_nodes now contains the 3 most relevant text chunks (Nodes)
# e.g., Node 1: "2025-05-01 08:22:47 VPN_LOGIN_FAILED user=eve.hacker ip=203.0.113.7"
#       Node 2: "2025-05-01 08:30:55 VPN_LOGIN_SUCCESS user=carol.white ip=192.168.1.52"
#       Node 3: "2025-05-01 09:01:08 VPN_LOGIN_FAILED user=john.doe ip=185.22.11.4"
#    vectordb_text = [Node1][Node2][Node3] 
# augmented_prompt=
#	"Context: 
#	[Node1][Node2][Node3] 
#	Question: failed vpn logins for 2 hours after 2025-05-01 09:01:08
#	Answer:"
#
#

query_engine = index.as_query_engine(
  llm=llm
)
response = query_engine.query("Show firewall policies blocking outbound traffic")
print(response)
Response=
The firewall policies blocking outbound traffic are as follows:

1. Policy: OUTBOUND_BLOCK
   - Source: 10.1.1.5
   - Destination: 8.8.8.8

2. Policy: OUTBOUND_BLOCK
   - Source: 10.1.1.6
   - Destination: 1.1.1.1

response = query_engine.query("Why is john.doe unable to connect to VPN?")
print(response)
Response=
john.doe is unable to connect to the VPN due to repeated login failures, as 
indicated by the log entries showing two instances of VPN_LOGIN_FAILED for the user.
############## Augumentation & Generation Phase End #################